65 research outputs found

    Performance analysis of a generalized upset detection procedure

    Get PDF
    A general procedure for upset detection in complex systems, called the data block capture and analysis upset monitoring process is described and analyzed. The process consists of repeatedly recording a fixed amount of data from a set of predetermined observation lines of the system being monitored (i.e., capturing a block of data), and then analyzing the captured block in an attempt to determine whether the system is functioning correctly. The algorithm which analyzes the data blocks can be characterized in terms of the amount of time it requires to examine a given length data block to ascertain the existence of features/conditions that have been predetermined to characterize the upset-free behavior of the system. The performance of linear, quadratic, and logarithmic data analysis algorithms is rigorously characterized in terms of three performance measures: (1) the probability of correctly detecting an upset; (2) the expected number of false alarms; and (3) the expected latency in detecting upsets

    Transient Faults in Computer Systems

    Get PDF
    A powerful technique particularly appropriate for the detection of errors caused by transient faults in computer systems was developed. The technique can be implemented in either software or hardware; the research conducted thus far primarily considered software implementations. The error detection technique developed has the distinct advantage of having provably complete coverage of all errors caused by transient faults that affect the output produced by the execution of a program. In other words, the technique does not have to be tuned to a particular error model to enhance error coverage. Also, the correctness of the technique can be formally verified. The technique uses time and software redundancy. The foundation for an effective, low-overhead, software-based certification trail approach to real-time error detection resulting from transient fault phenomena was developed

    Efficient diagnosis of multiprocessor systems under probabilistic models

    Get PDF
    The problem of fault diagnosis in multiprocessor systems is considered under a probabilistic fault model. The focus is on minimizing the number of tests that must be conducted in order to correctly diagnose the state of every processor in the system with high probability. A diagnosis algorithm that can correctly diagnose the state of every processor with probability approaching one in a class of systems performing slightly greater than a linear number of tests is presented. A nearly matching lower bound on the number of tests required to achieve correct diagnosis in arbitrary systems is also proven. Lower and upper bounds on the number of tests required for regular systems are also presented. A class of regular systems which includes hypercubes is shown to be correctly diagnosable with high probability. In all cases, the number of tests required under this probabilistic model is shown to be significantly less than under a bounded-size fault set model. Because the number of tests that must be conducted is a measure of the diagnosis overhead, these results represent a dramatic improvement in the performance of system-level diagnosis techniques

    Method and apparatus for fault tolerance

    Get PDF
    A method and apparatus for achieving fault tolerance in a computer system having at least a first central processing unit and a second central processing unit. The method comprises the steps of first executing a first algorithm in the first central processing unit on input which produces a first output as well as a certification trail. Next, executing a second algorithm in the second central processing unit on the input and on at least a portion of the certification trail which produces a second output. The second algorithm has a faster execution time than the first algorithm for a given input. Then, comparing the first and second outputs such that an error result is produced if the first and second outputs are not the same. The step of executing a first algorithm and the step of executing a second algorithm preferably takes place over essentially the same time period

    Certification trails for data structures

    Get PDF
    Certification trails are a recently introduced and promising approach to fault detection and fault tolerance. The applicability of the certification trail technique is significantly generalized. Previously, certification trails had to be customized to each algorithm application; trails appropriate to wide classes of algorithms were developed. These certification trails are based on common data-structure operations such as those carried out using these sets of operations such as those carried out using balanced binary trees and heaps. Any algorithms using these sets of operations can therefore employ the certification trail method to achieve software fault tolerance. To exemplify the scope of the generalization of the certification trail technique provided, constructions of trails for abstract data types such as priority queues and union-find structures are given. These trails are applicable to any data-structure implementation of the abstract data type. It is also shown that these ideals lead naturally to monitors for data-structure operations

    Using certification trails to achieve software fault tolerance

    Get PDF
    A conceptually novel and powerful technique to achieve fault tolerance in hardware and software systems is introduced. When used for software fault tolerance, this new technique uses time and software redundancy and can be outlined as follows. In the initial phase, a program is run to solve a problem and store the result. In addition, this program leaves behind a trail of data called a certification trail. In the second phase, another program is run which solves the original problem again. This program, however, has access to the certification trail left by the first program. Because of the availability of the certification trail, the second phase can be performed by a less complex program and can execute more quickly. In the final phase, the two results are accepted as correct; otherwise an error is indicated. An essential aspect of this approach is that the second program must always generate either an error indication or a correct output even when the certification trail it receives from the first program is incorrect. The certification trail approach to fault tolerance was formalized and it was illustrated by applying it to the fundamental problem of finding a minimum spanning tree. Cases in which the second phase can be run concorrectly with the first and act as a monitor are discussed. The certification trail approach was compared to other approaches to fault tolerance. Because of space limitations we have omitted examples of our technique applied to the Huffman tree, and convex hull problems. These can be found in the full version of this paper

    Experimental evaluation of certification trails using abstract data type validation

    Get PDF
    Certification trails are a recently introduced and promising approach to fault-detection and fault-tolerance. Recent experimental work reveals many cases in which a certification-trail approach allows for significantly faster program execution time than a basic time-redundancy approach. Algorithms for answer-validation of abstract data types allow a certification trail approach to be used for a wide variety of problems. An attempt to assess the performance of algorithms utilizing certification trails on abstract data types is reported. Specifically, this method was applied to the following problems: heapsort, Hullman tree, shortest path, and skyline. Previous results used certification trails specific to a particular problem and implementation. The approach allows certification trails to be localized to 'data structure modules,' making the use of this technique transparent to the user of such modules

    Certification of computational results

    Get PDF
    A conceptually novel and powerful technique to achieve fault detection and fault tolerance in hardware and software systems is described. When used for software fault detection, this new technique uses time and software redundancy and can be outlined as follows. In the initial phase, a program is run to solve a problem and store the result. In addition, this program leaves behind a trail of data called a certification trail. In the second phase, another program is run which solves the original problem again. This program, however, has access to the certification trail left by the first program. Because of the availability of the certification trail, the second phase can be performed by a less complex program and can execute more quickly. In the final phase, the two results are compared and if they agree the results are accepted as correct; otherwise an error is indicated. An essential aspect of this approach is that the second program must always generate either an error indication or a correct output even when the certification trail it receives from the first program is incorrect. The certification trail approach to fault tolerance is formalized and realizations of it are illustrated by considering algorithms for the following problems: convex hull, sorting, and shortest path. Cases in which the second phase can be run concurrently with the first and act as a monitor are discussed. The certification trail approach are compared to other approaches to fault tolerance

    Certification trails and software design for testability

    Get PDF
    Design techniques which may be applied to make program testing easier were investigated. Methods for modifying a program to generate additional data which we refer to as a certification trail are presented. This additional data is designed to allow the program output to be checked more quickly and effectively. Certification trails were described primarily from a theoretical perspective. A comprehensive attempt to assess experimentally the performance and overall value of the certification trail method is reported. The method was applied to nine fundamental, well-known algorithms for the following problems: convex hull, sorting, huffman tree, shortest path, closest pair, line segment intersection, longest increasing subsequence, skyline, and voronoi diagram. Run-time performance data for each of these problems is given, and selected problems are described in more detail. Our results indicate that there are many cases in which certification trails allow for significantly faster overall program execution time than a 2-version programming approach, and also give further evidence of the breadth of applicability of this method

    Generalizations of Yang-Mills Theory with Nonlinear Constitutive Equations

    Full text link
    We generalize classical Yang-Mills theory by extending nonlinear constitutive equations for Maxwell fields to non-Abelian gauge groups. Such theories may or may not be Lagrangian. We obtain conditions on the constitutive equations specifying the Lagrangian case, of which recently-discussed non-Abelian Born-Infeld theories are particular examples. Some models in our class possess nontrivial Galilean (c goes to infinity) limits; we determine when such limits exist, and obtain them explicitly.Comment: Submitted to the Proceedings of the 3rd Symposium on Quantum Theory and Symmetries (QTS3) 10-14 September 2003. Preprint 9 pages including reference
    corecore